Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parameter to coerce non-numeric values to NaN during validation #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

add parameter to coerce non-numeric values to NaN during validation #4

wants to merge 1 commit into from

Conversation

diegoquintanav
Copy link
Contributor

Maybe this parameter should be exposed, but I set it to coerce by default. All non numeric values are converted into np.NaN elements. Without this setting validation raises an error if a string is found in a column of ints or floats.

Please let me know what you think

@diegoquintanav
Copy link
Contributor Author

@TMiguelT what do you think about this?

@multimeric
Copy link
Owner

This seems reasonable. Could you add a test that currently breaks? ie a column containing non-numeric data?

@diegoquintanav
Copy link
Contributor Author

Sure I will during the week, if that's okay with you

@diegoquintanav
Copy link
Contributor Author

diegoquintanav commented Aug 1, 2018

this is really old, but I ran into this again:

Out[85]: df["my_column"].unique()
Out[85]: 
array(['nan', '2008', '2016', '2015', '2014', '2013', '2012', '2010',
      '2011', '2009', '2017'], dtype=object)

Say we have a simple dictionary

dictionary = ps.Schema(
    [
        ps.Column('my_column', [ps.validations.InRangeValidation(1900, 3000)]),
])
In [86]: errors = dictionary.validate(df, columns=["my_column"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "nan"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-86-2a1c78e8916b> in <module>()
----> 1 errors = dictionary.validate(df, columns=["my_column"])

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/schema.py in validate(self, df, columns)
     83         # Iterate over each pair of schema columns and data frame series and run validations
     84         for series, column in column_pairs:
---> 85             errors += column.validate(series)
     86
     87         return sorted(errors, key=lambda e: e.row)

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/column.py in validate(self, series)
     25         :return: An iterable of ValidationError instances generated by the validation
     26         """
---> 27         return [error for validation in self.validations for error in validation.get_errors(series, self)]

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/column.py in <listcomp>(.0)
     25         :return: An iterable of ValidationError instances generated by the validation
     26         """
---> 27         return [error for validation in self.validations for error in validation.get_errors(series, self)]

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/validation.py in get_errors(self, series, column)
     82         # Calculate which columns are valid using the child class's validate function, skipping empty entries if the
     83         # column specifies to do so
---> 84         simple_validation = ~self.validate(series)
     85         if column.allow_empty:
     86             # Failing results are those that are not empty, and fail the validation

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/validation.py in validate(self, series)
    205
    206     def validate(self, series: pd.Series) -> pd.Series:
--> 207         series = pd.to_numeric(series)
    208         return (series >= self.min) & (series < self.max)
    209

~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in ('ignore', 'raise') else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134
    135     except Exception:

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "nan" at position 0

I believe line 207 is the responsible, as it raises an error if it can't convert to numeric values. This is true for NaN values. errors parameter should be exposed or set to coerce to avoid unwanted exceptions

@multimeric
Copy link
Owner

Please just make a test case out of your example and I'll be happy to accept the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants